Search CORE

20 research outputs found

Toward Tweets Normalization Using Maximum Entropy

Author: Aiti Aw
Liyana Shuib
Mohammad Arshi Saloot
Norisma Idris
Ram Gopal Raj
Publication venue
Publication date: 02/04/2020
Field of study

Abstract The use of social network services and microblogs, such as Twitter, has created valuable text resources, which contain extremely noisy text. Twitter messages contain so much noise that it is difficult to use them in natural language processing tasks. This paper presents a new approach using the maximum entropy model for normalizing Tweets. The proposed approach addresses words that are unseen in the training phase. Although the maximum entropy needs a training dataset to adjust its parameters, the proposed approach can normalize unseen data in the training set. The principle of maximum entropy emphasizes incorporating the available features into a uniform model. First, we generate a set of normalized candidates for each out-ofvocabulary word based on lexical, phonemic, and morphophonemic similarities. Then, three different probability scores are calculated for each candidate using positional indexing, a dependency-based frequency feature and a language model. After the optimal values of the model parameters are obtained in a training phase, the model can calculate the final probability value for candidates. The approach achieved an 83.12 BLEU score in testing using 2,000 Tweets. Our experimental results show that the maximum entropy approach significantly outperforms previous well-known normalization approaches

CiteSeerX

Regenerating Hypotheses for Statistical Machine Translation

Author: Aiti Aw
Boxing Chen
Haizhou Li
Min Zhang
Publication venue
Publication date: 01/01/2008
Field of study

This paper studies three techniques that improve the quality of N-best hypotheses through additional regeneration process. Unlike the multi-system consensus approach where multiple translation systems are used, our improvement is achieved through the expansion of the N-best hypotheses from a single system. We explore three different methods to implement the regeneration process: redecoding, n-gram expansion, and confusion network-based regeneration. Experiments on Chinese-to-English NIST and IWSLT tasks show that all three methods obtain consistent improvements. Moreover, the combination of the three strategies achieves further improvements and outperforms the baseline by 0.8

CiteSeerX

Crossref

Linguistically annotated BTG for statistical machine translation

Author: Aiti Aw
Deyi Xiong
Haizhou Li
Min Zhang
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2008
Field of study

Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into sta-tistical machine translation (SMT). In this paper, we propose a Linguistically Anno-tated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syn-tax structures to BTG hierarchical struc-tures through linguistic annotation. From the linguistically annotated data, we learn annotated BTG rules and train linguisti-cally motivated phrase translation model and reordering model. We also present an annotation algorithm that captures syntac-tic information for BTG nodes. The ex-periments show that the LABTG approach significantly outperforms a baseline BTG-based system and a state-of-the-art phrase-based system on the NIST MT-05 Chinese-to-English translation task. Moreover, we empirically demonstrate that the proposed method achieves better translation selec-tion and phrase reordering.

CiteSeerX

Crossref

A syntax-driven bracketing model for phrase-based translation

Author: Aiti Aw
Deyi Xiong
Haizhou Li
Min Zhang
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

Syntactic analysis influences the way in which the source sentence is translated. Previous efforts add syntactic constraints to phrase-based translation by directly rewarding/punishing a hypothesis when-ever it matches/violates source-side con-stituents. We present a new model that automatically learns syntactic constraints, including but not limited to constituent matching/violation, from training corpus. The model brackets a source phrase as to whether it satisfies the learnt syntac-tic constraints. The bracketed phrases are then translated as a whole unit by the de-coder. Experimental results and analy-sis show that the new model outperforms other previous methods and achieves a substantial improvement over the baseline which is not syntactically informed.

CiteSeerX

Crossref

A phrase-based statistical model for SMS text normalization

Author: Aiti Aw
Jian Su
Juan Xiao
Min Zhang
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2006
Field of study

Short Messaging Service (SMS) texts be-have quite differently from normal written texts and have some very special phenom-ena. To translate SMS texts, traditional approaches model such irregularities di-rectly in Machine Translation (MT). How-ever, such approaches suffer from customization problem as tremendous ef-fort is required to adapt the language model of the existing translation system to handle SMS text style. We offer an alter-native approach to resolve such irregulari-ties by normalizing SMS texts before MT. In this paper, we view the task of SMS normalization as a translation problem from the SMS language to the English language 1 and we propose to adapt a phrase-based statistical MT model for the task. Evaluation by 5-fold cross validation on a parallel SMS normalized corpus of 5000 sentences shows that our method can achieve 0.80702 in BLEU score against the baseline BLEU score 0.6958. Another experiment of translating SMS texts from English to Chinese on a separate SMS text corpus shows that, using SMS normaliza-tion as MT preprocessing can largely boost SMS translation performance from 0.1926 to 0.3770 in BLEU score.

CiteSeerX

Crossref